The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing

نویسندگان

  • Sriram Sankaran
  • Jeffrey M. Squyres
  • Brian W. Barrett
  • Vishal Sahay
  • Andrew Lumsdaine
  • Jason Duell
  • Paul Hargrove
  • Eric Roman
چکیده

As high performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. To address these issues, we present the design and implementation of a system for providing coordinated checkpointing and rollback recovery for MPI-based parallel applications. Our approach integrates the Berkeley Lab BLCR kernel-level process checkpoint system with the LAM implementation of MPI through a defined checkpoint/restart interface. Checkpointing is transparent to the application, allowing the system to be used for cluster maintenance and scheduling reasons as well as for fault tolerance. Experimental results show negligible communication performance impact due to the incorporation of the checkpoint support capabilities into LAM/MPI.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Hybrid Full/Incremental Checkpoint/Restart for MPI Jobs in HPC Environments

As the number of cores in high-performance computing environments keeps increasing, faults are becoming common place. Checkpointing addresses such faults but captures full process images even though only a subset of the process image changes between checkpoints. We have designed a high-performance hybrid disk-based full/incremental checkpointing technique for MPI tasks to capture only data chan...

متن کامل

A Checkpoint and Restart Service Specification for Open MPI

HPC systems are growing in both complexity and size, increasing the opportunity for system failures. Checkpoint and restart techniques are one of many fault tolerance techniques developed for such adverse runtime conditions. Because of the variety of available approaches for checkpoint and restart, HPC system libraries, such as MPI, seeking to incorporate these techniques would benefit greatly ...

متن کامل

The Design and Implementation of Berkeley Lab’s Linux Checkpoint/Restart

Clusters of commodity computers running Linux are becoming an increasingly popular platform for highperformance computing, as they provide the best price/performance ratio in the marketplace. But while the size and raw power of Linux clusters continues to increase, many aspects of their software environments continue to lag behind those provided by proprietary supercomputing systems. One featur...

متن کامل

Automatic Resource-Centric Process Migration for MPI

Process migration refers to the ability to move a running process from one node and make it continue on another. The MPI standard prescribes support for process migration, but so far it was implemented mostly via checkpoint-restart. This paper presents an automatic and transparent process migration framework that can be used for MPI processes. This framework is advantageous when migration of in...

متن کامل

DMTCP Checkpoint/Restart of MPI Programs via Proxies

MPI accomplishes portable, standardized message-passing between processes by exposing a standard API that hides the implementation of the underlying mechanism for message passing. Until now, checkpointing an MPI program required knowledge of these underlying mechanisms. Through the addition of a proxy, we demonstrate that MPI programs can be checkpointed and restarted regardless of the MPI impl...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IJHPCA

دوره 19  شماره 

صفحات  -

تاریخ انتشار 2005